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Abstract —Markov Decision Processes (MDPs) have been 
used to formnlate many decision-making problems in science 
and engineering. The objective is to synthesize the best deci¬ 
sion (action selection) policies to maximize expected rewards 
(minimize costs) in a given stochastic dynamical environment. 
In many practical scenarios (mnlti-agent systems, telecom- 
mnnication, quening, etc.), the decision-making problem can 
have state constraints that mnst be satisfied, which leads to 
Constrained MDP (CMDP) problems. In the presence of such 
state constraints, the optimal policies can be very hard to 
characterize. This paper introduces a new approach for finding 
non-stationary randomized policies for finite-horizon CMDPs. 
An efficient algorithm based on Linear Programming (LP) and 
duality theory is proposed, which gives the convex set of all 
feasible policies and ensures that the expected total reward 
is above a computable lower-bound. The resulting decision 
policy is a randomized policy, which is the projection of the 
unconstrained deterministic MDP policy on this convex set. 
To the best of our knowledge, this is the first result in state 
constrained MDPs to give an efficient algorithm for generating 
finite horizon randomized policies for CMDP with optimality 
guarantees. A simulation example of a swarm of autonomous 
agents rnnning MDPs is also presented to demonstrate the 
proposed CMDP solution algorithm. 

1. Introduction 

Markov Decision Processes (MDPs) have been used to 
formulate many decision-making problems in a variety of 
areas of science and engineering [l]-[3]. MDPs can also be 
useful in modeling decision-making problems for stochastic 
dynamical systems where the dynamics cannot be fully 
captured by using first principle formulations. MDP models 
can be constructed by utilizing the available measured data, 
which allows construction of the state transition probabil¬ 
ities. Hence MDPs play a critical role in big-data ana¬ 
lytics. Indeed very popular methods of machine learning 
such as reinforcement learning and its variants [4] [5] are 
built on the MDP framework. With the increased interest 
and efforts in Cyber-Physical Systems (CPS), there is even 
more interest in MDPs to facilitate rigorous construction of 
innovative hierarchical decision-making architectures, where 
MDP framework can integrate physics-based models with 
data-driven models. Such decision architectures can utilize a 
systematic approach to bring physical devices together with 
software to benefit many emerging engineering applications, 
such as autonomous systems. 

In many applications [6] [7], MDP models are used to 
compute optimal decisions when future actions contribute to 
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the overall mission performance. Here we consider MDP- 
based sequential stochastic decision-making models [8]. An 
MDP model is composed of a set of time epochs, actions, 
states, and immediate rewards/costs. Actions transfer the 
system in a stochastic manner from one state to another 
and rewards are collected based on the actions taken at the 
corresponding states. Hence MDP models provide analytical 
descriptions of stochastic processes with state and action 
spaces, the state transition probabilities as a function of 
actions, and with rewards as a function of the states and 
actions. The objective is to synthesize the best decision 
(action selection) policies to maximize expected rewards 
(minimize costs) for a given MDP. It is well-known that 
optimal policies must be stationary deterministic when then 
the environment is stationary [8] and when there are no 
state constraints. We present new results that aim to increase 
fidelity of MDPs for decision-making by incorporating a 
general class of state constraints in the MDP models, which 
then lead to randomized action selection policies. 

In this paper, we study the problem of finding non- 
stationary randomized policy solutions for finite-horizon 
constrained MDPs (CMDPs). We consider a finite state MDP 
with randomized action sets. We give an efficient algorithm 
based on Linear Programming (LP) and duality theory of 
convex optimization [9] that optimizes over the convex set of 
all feasible policies and guarantees the expected total reward 
to be above a computable lower bound. Then the proposed 
policy is the projection of the unconstrained MDP policy 
on this convex set. To best of our knowledge, this is the 
first result in state constrained MDP problems that gives an 
efficient algorithm for generating finite horizon randomized 
policies for CMDP with reward/cost guarantees. Another 
advantage of the proposed solution is that it is independent of 
initial state of the system. Thus it can be solved offline and 
implemented in large-scale systems of multi-agent systems. 

IT Related Work 

In MDPs, state constraints can be utilized in several ways. 
They can be used to handle multiple design objectives where 
decisions are computed to maximize rewards for one of the 
objectives while guaranteeing the value of the other objective 
to be within a desired value [10]. The constraints can also 
be imposed by the environment (e.g., safety constraints 
imposed by a mission as in multi-agents autonomous systems 
[11]), or telecommunication applications [12]. In these state 
constrained MDPs, the calculation of optimal policies can be 
much more difficult, so the constraints are usually relaxed 
with the hope that the resulting decisions would still provide 



feasible solutions. However, in some applications, these 
constraints are critical [11], [13]-[15]. Consider an example 
of exploring a disaster area for search and rescue by using 
multiple autonomous vehicles. Suppose agents are running 
MDP policies to explore the area based on a priori knowledge 
of the potential survivors. Due to safety conditions, vehicles 
may not be allowed to visit certain regions often, which 
can impose strict constraints on the probability of having 
a vehicle in such regions at any given time epoch. These 
safety considerations can be formulated as constraints on the 
probability distribution of the agent state, e.g., an inequality 
constraint for MDPs with discrete state and action spaces of 
finite cardinality 

Bxt < d Vf > 0, 

where B G is a matrix and d G M” is a vector that 

describes the safety constraints, xt G K" is the vector whose 
elements are the discrete probabilities for 

an agent to be in state i at time t, and < denotes element¬ 
wise inequality. This paper aims to incorporate such safety 
constraints into MDP formulations, that is, optimal decision 
policies are synthesized within the constraints. The state con¬ 
straint above also allows direct relationship with the chance- 
constrained decision making, eg., chance constrained motion 
planning [16], where state constraints must be satisfied with 
a prescribed probability. MDPs with constraints has recently 
been applied to path planning in robotics applications [17]. 
Beyond being able to ensure safety, these constraints can 
provide advancements in machine learning methods [18]- 
[20]. Incorporating the knowledge of physical constraints can 
potentially improve the estimates of MDPs by better utilizing 
the real-time data, i.e., by searching the MDP parameters in 
smaller and better constrained feasible sets. 

Previous research has focused on finding infinite-horizon 
stationary policies for constrained MDPs. Due to the con¬ 
straints, the optimal policies might no longer be deterministic 
and stationary [21]. [22] gives an example of a transient 
multi-chain MDP with state constraints and shows that the 
Bellman principle fails to hold and the optimal policy is 
not necessarily stationary. In the presence of constraints, 
randomization in the actions can then be necessary for 
obtaining optimal policies [23] [24]. For stationary policies 
to be optimal, specific assumptions on the underlying Markov 
chain are often introduced [25]. Stationary policies for these 
specific models can be found by using algorithms based on 
Linear Programming (LP) [26] or Lagrange multipliers ( [27] 
[28]). However, finding optimal policies in the broader class 
of randomized policies for CMDPs can be very expensive 
computationally and there is no previously known algorithm 
for the general cas^ [10] [30]. 

III. Preliminaries and Notation 

A. States and Actions 

Let the set S = n} be the set of states (note 

that S is finite of cardinality [S'! = n). Let us define 
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As = {1 ,... ,p} to be the set of actions available in state 
s (without loss of generality the number of actions does not 
change with the state, i.e., |^s| = p for any s G S). We 
consider a discrete-time system where actions are taken at 
different decision epochs. Let st and at be respectively the 
state and action at the <-th decision epoch. 

B. Decision Rule and Policy 

We define a decision rule Dt at time t to be the fol¬ 
lowing randomized function Dt : S ^ As that defines 
for every state s S S' a random variable Dt{s) G As 
with a probability distribution defined on V{As) as follows 
= Prob[Z7t = a|st = s] for any action a G As- Let 

TT = (I?l, £>2, ■ ■ ■, Dm-i) 

be the policy for the decision making process given that 
there are — 1 decision epochs. Note that this decision 
rule has a Markovian property because it depends only 
on the current state. Indeed this paper considers only the 
Markovian policies, and history dependent policies [8] are 
not considered. 

C. Rewards 

Given a state s G S and action a G A, we define the 
reward rt{st,at) S K to be any real number and let TZ be 
the set having these values. With a little abuse of notation, 
we define the expected reward for a given decision rule Dt 
at time t to be 

rt{s) = E[rt(s, A(s))] = EaGA, <lDp8){a)rt{s, a), 

and the vector r* G K" to be the vector with the expected 
rewards for each state. Since there are JV—1 decision epochs, 
there are JV reward stages and the final stage reward is given 
by rAr(sjv) (or the vector whose entries are the final 
rewards for each state). 

D. State Transitions 

We now define the transition probabilities as follows, 
Pt{j\s,k) = Prob[st+i = j\st = s, at = k], and let G be 
the set of these transition probabilities. Let 

Pt{j\s,d{s)) = EaGA. qDti8){a)ptij\s,a), 

then the elements of the transition matrix Mt G K”’” are; 

Mt[i,j] = Prob[st+i = i\st = j] = pt{i\j, d{j)). 

Let Xt[i] = Prob[st = i|si] to be the probability of being at 
state i at time t, and x* G K’" to be a vector having these 
elements. Then the system evolves according to the following 
recursive equation X(_|_i = Mt Xj, t = 1, ..., N. 

E. Markov Decision Processes (MDPs) 

Let 7 G [0,1] be the discount factor, which represents 
the importance of a current reward in comparison to future 
possible rewards. We will consider 7 = 1 throughout the 
paper, but the results are not affected and remain applicable 
after a suitable scaling when 7 < 1. 


A discrete MDP is a 5-tuple {S, As,G where S is 
a finite set of states, Ag is a finite set of actions available for 
state s, Q is the set that contains the transition probabilities 
given the current state and current action, and 7^ is the set 
of rewards at time t due to the current state and the action. 


F. Performance Metric 

For a policy to be better than another policy we need 
to define a performance metric. We will use the expected 
discounted total reward for our performance study. 


WjV — ®xi 


'N-l 


Y, n{XuDt{Xt))+rN{XN) 


where the expectation is conditioned on knowing the prob¬ 
ability distribution of the initial states (i.e., knowing xi G 
Vis) where Xi[i\ = Prob[si = i]). For example if the agent 
was in state s at t = 1, then Xi = Og where is a vector 
of all zeros except for the s-th element which is equal to 1. 
It is worth noting that in the above expression, both Xt and 
Dt are random variables. 


IV. Optimal Markovian Policy Synthesis 
Problem 

The optimal policy tt* is given as the policy that max¬ 
imizes the performance measure, tt* = argmax^ and 
v’^ to be the optimal value, i.e., v’^ = max^rt^]!^. Note 
that this maximization is unconstrained and the optimization 
variables are tot(s)(a) for any s G S and a G Agj^The 
backward induction algorithm [8, p. 92] based on dynamic 
programming gives the optimal policy as well as the optimal 
value by using Algorithm [T] 


Algorithm 1 Backward Induction; Unconstrained MDP Op¬ 
timal Policy 

I: Definitions: For any state s £ S', we define 

U-(s) = [Ek=t'^k{Xk,Dk{Xk))+rk{XN) 

and Vf{s) = max^P^'^ given that St = s. 

2: Start with V^{s) = rjv(s) 

3: for f = iV — 1,..., 1 given for all s G S calculate 
the optimal value 

Kis) = max {rt(s, a) + ^Pt(j|s, a)Vt\^{j)} 
jes 


and the optimal policy is defined by q_Dj(g)(a) = 1 for 
a = (s) given by: 


a^s) = argmax^{rt(s,a) -|-^pt(j|s,a)Vt+i(j)} 

jes 


t = 1,... ^N. Next we introduce state constraints as follows 


B'x.t < d, for t = 1,... ,N, 

where < denotes the element-wise inequalities, and d is 
a vector giving upper bounds on the state/transition prob¬ 
abilities. These state constraints lead to correlations between 
decision rules at different states and the backward induction 
algorithm cannot then be used to find optimal policy when 
the state constraints exist. Even finding a feasible policy can 
be very challenging. We refer to this problem as Constrained 
MDP (CMDP). 

The optimal policy synthesis problem with constraints on 
Xt can then be written as follows. 


Qi 

max vfii 

,...,Qiv-i 

s.t. Bxt 

< d, for f = 1 



Qti 

= 1, for t = 1. 

...,iV-l 


Qt 

> 0, for t = 1, 

...,7V-1, 

where 

Qt G 

is the matrix of 

decision variables 

gt(s,a 

) = QDt{s){a). 

The last two sets of constraints guar- 


antee that the variables define probability distributions. B 
is a constant matrix, which is assumed to be the identity 
B = In (but the following discussion easily extends to any 
matrix B). Without the first set of constraints, the rows in 
Qt are independent and they are not correlated. With the 
added first set of constraints (that are non-convex because 
Xt = Mt-i... M 2 M 1 X 1 , where each of the matrices Mi 
is a linear function of the variables Qi) correlation would 
exist between the rows of Qt’s and the backward induction 
that leverages the independence of the rows of Qi cannot 
be applied as usual. The next section introduces a dynamic 
programming based algorithm for the above problem. 


V. Dynamic Programming (DP) Approach to 
Markovian Policy Synthesis 


In this section, we transform the MDP problem into a 
deterministic Dynamic Programming (DP) problem, give the 
equivalence with the unconstrained case and discuss how to 
solve the (more complicated) state constrained problem. First 
note that the performance metric can be written as follows; 


— Exi 
N-l 


'N-l 


YrtiXt,DtiXt))+rtiXN) 

= Y ExJrt(At, A(At))] +ExJrt(Ajv)] 

N N 

= Y^^i{^xAt] = Y^t^t’ 


4: Result: U]*(si) = where si is the initial state. 


Algorithm solves the optimal policy selection in the 
absence of constraints on the expected state vector xt for 

^Since v'^ is continuous in the decision variables that belong to a closed 
and bounded set, then the max is always attained and argmax is well 
defined. 


where the last equality utilized the fact that ExjXt] = Xj. 

Next we present the DP formulation. Let x^ to be an 
element of the extended state space S = 'P{S) (where V{.) 
denotes the probability space). The discrete-time dynamical 
system describing the evolution of the density X( can then 
be given by 

xt+i = /t(xt,Qt) for f = 1,... ,iV - 1, 












such that /t(xt, Qt) = Mt{Qt)Kt where Mt[Qt) is a column 
stochastic matrix linear in the optimization variables Qt- It is 
important to note that the elements of the i-th column in Mt 
are linear functions of only the elements in the t-th row of 
Qt, not all elements of Qt. The above dynamics show that the 
probability distribution evolves deterministically. Our policy 
TT = {Di,..., Disf-i) consists of a sequence of functions 
that map states X( into controls Qt = Dt{xt) such that 
Dti'x.t) S C(xt) where C(xt) is the set of feasible controls. 
In case of absence of the constraints (Sxt < d), C(xt) is 
independent of Xf and all admissible controls belong to the 
same convex set C for all states. 

The additive reward per stage is defined as PAf(xAr) = 
xJfVpf and 

gt{xt,Qt) = xjrt, for f = 1,..., iV - 1. 

The DP algorithm calculates the optimal value ^*N (and 
policy TT*) as follows [31, Proposition 1.3.1, p. 23]: 


Algorithm 2 Dynamic Programming (DP) 

1 : Start with Jn{^) = 9n{^) 

2 : for f = TV — 1,..., 1 

Jt(x) = max {gft(x, Qt) + Jt+i(/t(x, Qt))} . 

QtGC(x) 

3: Result: Ji(x) =v’^. 


Remark. There are several difficulties in applying the DP 
Algorithmic Note that in the expression Jt+i(/t(x, Qt)) 
used in the algorithm, Qt is an optimization variable. For a 
given Qt and x, numerical methods can be used to compute 
the value of Jt+i. But since Qt itself is an optimization 
variable, the solution of the optimization problem in line 2 
of Algorithm 2 can be very hard. In some special cases, 
for example when Jt{x) can be expressed analytically in 
a closed from, the solution complexity can be reduced 
significantly, as in the unconstrained MDP problems. □ 

A. Solving Unconstrained MDPs by DP 

Here we use the DP algorithm to derive well-known results 
on optimal MDP policies [8]. Even though the DP approach 
is not new (i.e.. Algorithm [C itself is derived from theory of 
operators), its application to finite-state MDPs will provide 
new insights. Specifically, when finite-state MDPs are subject 
to state constraints, the existing theory cannot be readily 
applied. In that case, we show that the DP algorithm can 
still provide useful results. Therefore, the purpose of this 
section is to apply the DP algorithm for unconstrained MDPs 
to obtain well-known results, and set up its use for more 
complicated finite-state constrained MDP problems. 

Next we present the closed-form solution of the uncon¬ 
strained MDPs via Algorithm [T] In the absence of state 
constraints, the set of admissible actions at time t, denoted 
by C{xt) = C, can be described as 

Qtl = 1, and Qt > 0. 


Note that each row of Qt represents the action choice 
probabilities for a given state, i.e., Q) G C® for i = 1,... ,n 
where Q) is the i-th row in Qt and C® is the set of row 
vectors of probabilities having dimension |A|. We can now 
apply the DP Algorithm by letting JAr(x) = x'^rjy, and 
iterating backward from t = N — 1 to t = l as follows 


Jt(x) 


max{x^rt -f Jt+i(Mtx)} 
Qt^C 

max {x^rt + 


max {^x[z](rt(i,Q®)-f M/’®(Q®)Vt*+i)} 
% 

[ max {rt{i,Ql) + MP{Ql)Vt\i}] , 

■ \Qt^C' / 


where is the optimal value function computed by 

Algorithm [h and indicates the transpose of the i- 

th column of Mt which is a linear function of the variables 
in the i-th row of Qt- The last equality is due to the fact that 
a:[i] > 0 for all i and the value function is separable in terms 
of the optimization variables. The maximization inside the 
parenthesis in the last equation gives and hence 


-It(x) =^x[i]V;[i] = x^Vt, 

i 


which has a closed-form solution as function of x. This dis¬ 
cussion justifies that the calculation of Vj* for t = N,... ,1 
in Algorithm [T] is sufficient for finding the optimal value of 
the MDP given by v'^ = Ji(xi) = xJV*. 

Remark. V* obtained via Algorithm [pleads to a determin¬ 
istic Markovian policy, which also defines an optimal policy 
for the unconstrained MDP, i.e., the policy that maximizes 
the total expected reward. It must be emphasized that, if 
state constraints were present, then Algorithm 1 does not 
necessarily yield an optimal, or even a feasible, policy. □ 


B. Solving State Constrained MDPs by DP 

When the state constraints are present, Jt (x) does not have 
a closed-form solution, and hence finding an optimal (even 
a feasible) solution is challenging. This section presents a 
new algorithm. Algorithm]^ to compute a feasible solution 
of the state constrained MDPs with lower bound guarantees 
on the expected reward. 

Theorem 1. Algorithm^provides a feasible policy for the 
state constrained MDP that guarantees the expected total 
reward to be greater than a lower bound Rf = xfUi, i.e., 
v% > R*. 

Proof. The proof is based on applying the DP Algorithm 
Letting Jn{^) = x^rjv, it will be shown by induction that 
'-lt(x) > x'^Ut. It is true for t = N. Now supposing that 
it is true for f -f 1, let’s prove it true for t. We have from 





Algorithm 3 Backward Induction; State constrained MDP 
1: Definitions: Qt for t = 1 are the optimization 

variables, describing the decision policy. Let A = {x S 
M”: 0<x<d, l^x=l} and C= n^C(x) with 

C(x) = Ql = l, Q>0,M(Q)x<d} 

where M{Q) is the transition matrix linear in Q. 

2: Set Upf = r^r- 

3: For t = N — 1,..., 1, given Ut+i, compute the policy 

Qt = argmax min frt(Q) + Mt{Q)'^Ut+i) , (2) 
qgc ^ ' 

and the vector of expected rewards 

tt = vt{Qt)^Mt{CltfiJt+i. 

4: Result: > xJ’Ui 


Algorithm that 

Jt(x) = max {gt(x,Qt) + Jt+i(ft(x, Qt))} 

Qtec(x) 

= max {x'^rt + Jt+i(Mtx)} 

Qtec(x) \ 

> max I x'^rt + Ut+i | 

QtGC(x) I ' ^ J 

> max |x^(rt + 

Qt^C I J 

> 

where in the third line we applied the induction hypothesis 
and the last line by the definition of Qt- Since Ji(xi) = 
(line 3 in Algorithm]^, this ends the proof. □ 

Therefore, by calculating Ut for f = iV — 1,..., 1 from 
Algorithm 1^ we can find a policy that gives minimum 
guarantees on the total expected reward, namely > R^, 
where tt = (Qi,Q2, ■ • ■,Qn-i)- 

C. LP formulation of Qt 

The calculation of Qt for f=l,...,A^—lin equation 
(|^ is the main challenge in the application of Algorithm]^ 
This section describes a linear programming approach to the 
computation of Qt solution in every iteration loop of Step 3 
in Algorithm]^ From the algorithm, Qt is given as follows: 

Qt = argmax min x'^UtiQt), 

QtGC 

where Ut{Qt) = rt{Qt) + Mt{Qt)'^Ut+i 

with UN = rN and Ut = Ut{Qt), t = N-l,...,l. 

(3) 

Let Rt G R"‘’P be the matrix having the elements rt(s,a) 
where s G S and a G A. Let Gt^k be the transition matrix 
having the elements Gt^klfj] =Pt{‘i\j,k). Then we have 

MtiQt) = '^Gt,k & (HQtek)'^) and rt(Qt) = (i?*©Q*)!. 

k 

(4) 


Theorem 2. The max-min problem given by 0 with 0 can 
be solved by the following equivalent linear programming 
problem (given t, d, Gt^k for k = 1.. .p, Rt, and Ut+i): 


maximize 

Q,y,z,r,M,S,s,K 

subject to 


- d^y + z 

P 

M = J2Gt,k(D{l{Qek)^) 

k=l 

r = (i?t © Q)1 

- y + zl < r + M'^Ut+i 
K = M + S + sl'^ 

s + d > Kd 

Q1 = 1,Q > 0,y > 0,5' > 0, a: > 0. 

(5) 


Proof. The proof will use the duality theory of linear pro¬ 
gramming [9], which implies that the following primal and 
dual problems produce the same cost 


PRIMAL DUAL 

minimize b^x maximize c^y 

s.t. A^x > c, X > 0 s.t. Ay < b, y > 0. 

Since the set X is defined as 0 < x < d and x^l = 1, then 
the min in (|^ can be obtained by a minimization problem 
with the following primal problem parameters: 

b = Ut,A = [—In 1 — 1], and = [—d^ 1 — 1]. 

The dual of this program is 

maximize — d^y -f z 

subject to — y + 2:1 < Ut{Qt) 

y > 0, z unconstrained. 

Next by considering the argmax in Q, it remains to show 
that the set C can be represented by linear inequalities to 
write ([^ as a maximization LP problem. It is indeed the 
case by using [11, Theorem 1] which says the following; 

M G M. 35 > 0, AT > 0, s such that 

K = M + S + sl^ 
s + d > Kd, 

where M. = n Af (x) and Ai(x) = {M G l^M = 

,M > 0,Mx < d}. As M in Q is a linear function of 
the decision variable Q, the set C is equivalently described 
by C = {Q G R"’P,Q1 = 1,Q > 0,M(Q) G M}, which 
implies that C can be described by linear inequalities. Now 
combining this result with the dual program, we can conclude 
that Qt can be obtained via the linear program given in the 
theorem, which concludes the proof. □ 

D. Better Heuristic for Qt 

The resulting solution Qt of linear program 0 is not 
unique. Let the convex solution set be Q. Therefore, among 
the possible solution variables Q G Q we are interested in 
values that are as close as possible to Qmdp (found by 





Algorithm [^1 because the latter gives optimal solution policy 
for unconstrained MDP problem. But since Qmdp might not 
be feasible (due to the additional constraints), we target the 
projection of Qmdp on Q. Therefore, we choose Qt to be 
the solution of the following optimization: 

Qt = argmin \\Q - Qmdp\\- 
QeQ 

Note that if Qmdp G Q, then this optimization will give the 
optimal policy. Therefore, with this extra optimization, the 
output policy not only guarantees a lower bound on expected 
reward reward, but it also retrieves back the solution of the 
unconstrained MDP if the state constraints were relaxed. 

VI. Simulations 

This section presents a simulation example to demonstrate 
the performance of the proposed methodology for CMPDs 
on a vehicle swarm coordination problem [11], [32]. In this 
application, autonomous vehicles (agents) explore a region 
F, which can be partitioned into n disjoint subregions (or 
bins) Fi for i = l,...,n such that F = UiT). We can 
model the system as an MDP where the states of agents are 
their bin locations and the actions of a vehicle are defined 
by the possible transitions to neighboring bins. Each vehicle 
collects rewards while traversing the area where, due to the 
stochastic environment, transitions are stochastic (i.e., even if 
the vehicle’s command is to move to “right”, the environment 
can send the vehicle to “left”). Note that the state constraints 
discussed in this paper can can be interpreted as follows. 
If a large number of vehicles are used, then the density 
of vehicles evolve as a Markov chain. Since the physical 
environment (capacity/size of bins) can impose constraints 
on the number of vehicles in a given bin, the state (safety) 
constraints on the density are needed. 

For simplicity we consider the operational region to be a 3 
by 3 grid. Each vehicle has 5 possible actions, “up”, “down”, 
“left”, “right”, and “stay”, see Figure When the vehicle 
is on the boundary, we set the probability of actions that 
cause transition outside of the domain to zero. For example 
in bin 1 the actions “left” and “up” are not permitted, which 
can easily be imposed as linear equality constraints in our 
formulation. 

The reward vectors Rt for t = 1,..., — 1 and are 

(tenth state is not icn, 

Rt=[ 1 1 1 10 5 0 3 3 3 1^ and 

T (6) 

i?Ar = [000 10 00000] 

where i?[i] is the reward collected at bin (state) i and is 
assumed independent of the action taken. Density constraints 
for different bins are given as follows 

d=[ 0.4 0.4 0.4 0.5 0.05 1 0.2 0.2 0.2 ]^, 

(7) 

where any bin i should have Xt[i] < d[i] for t = 1,2,.... 
The MDP solution (which is known to give deterministic 
policies) that maximizes the total expected reward does 
not satisfy these constraints. However, with our proposed 
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Fig. 1. Illustration of 3 x 3 grid describing the MDP states, and the 5 
actions (Up 
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Down, Left, Right, and Stay). 
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Fig. 2. The figure shows the density of autonomous vehicles and how the 
policy for unconstrained MDP can violate the constraints. The density for 
bin 4 jumped all the way to 1 after 5 iterations while its maximum capacity 
is 0.5. The synthesized CMDP policy obeys the constraints while giving a 
lower bound guarantee on the expected reward. 


policy, not only the constrained are satisfied, but also the 
solution gives guarantees on the expected total reward. Note 
that the linear program generates the policies independent 
of the initial distribution. Therefore, even if the latter was 
unknown (which is usually the case in autonomous swarms), 
the generated policy satisfy the constraints. 

We now consider that all the vehicles initially are in bin 
6 (i.e., Xt=i[6] = 1 and note that this is a feasible starting 
vector because d[6] = 1). Figure]^ shows that in the scenario 
considered in this simulation, the unconstrained MDP policy 
would lead all the swarm vehicle to one bin and it would 
violate the constraint because the maximum allowed density 
is 0.5. The constraints are also violated at the bin 5 as 
optimal policy made the swarm traverse this bin leading to 
a density of 0.8 where the maximum allowed density was 
0.05. However, our policy generated from Algorithm led 
to a distribution of the swarm in such a way the constrains 
are satisfied at every iteration. To further investigate the 
efficiency of the algorithm we have to study the rewards 
associated to the proposed policy. 

In figure we show the reward of the constrained MDP 
policy and we compare it to the unfeasible policy of uncon¬ 
strained MDP. It turns out that in this scenario, the added 
heuristic generated by the proposed methods in this paper 
could achieve closer reward to the maximum possible reward 
without constraints. The constrained MDP curve in crossed 
line (yellow) is the lower bound derived by Theorem]^ which 


















Fig. 3. The curve corresponding to the unconstrained MDP is the total 
expected reward for the optimal MDP policy without considering the state 
constraints. Of course the policy is unfeasible and cannot be used when 
constraints are present. The constrained MDP is the reward corresponding 
to the policy computed by the linear program (it is the computed lower 
bound on the total expected reward). The constrained MDP plus heuristic 
is the further enhancement obtained by projecting the optimal deterministic 
MDP on the set of feasible policies for CMDP with reward guarantees. 


providing optimality guarantees for the LP generated policy. 
VII. Conclusion 

In this paper, we have studied hnite-state hnite-horizon 
MDP problems with state constraints. It is shown that poli¬ 
cies due to unconstrained MDP algorithms are not feasible 
and we propose an efficient algorithm based on linear pro¬ 
gramming and duality theory to generate feasible Markovian 
policies that not only satisfy the constraints, but also provide 
some guarantees on the expected reward. This new policy 
defines a probability distribution over possible actions and 
requires that agents randomize their actions depending on 
the state. In the absence of constraints, the proposed method 
retrieves back the optimal standard MDP policies. For future 
work, we would like to extend the proposed policy for the 
inhnite-horizon case using a similar algorithm as the “value 
iteration” of standard MDP problems. 
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